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Network Newton-Part II: 
Convergence Rate and Implementation 

Aryan Mokhtari, Qing Ling and Alejandro Ribeiro 


Abstract —The use of network Newton methods for the decen¬ 
tralized optimization of a sum cost distributed through agents 
of a network is considered. Network Newton methods reinterpret 
distributed gradient descent as a penalty method, observe that the 
corresponding Hessian is sparse, and approximate the Newton 
step by truncating a Taylor expansion of the inverse Hessian. 
Truncating the series at K terms yields the NN-A that requires 
aggregating information from K hops away. Network Newton 
is introduced and shown to converge to the solution of the 
penalized objective function at a rate that is at least linear in 
a companion paper Q. The contributions of this work are: (i) To 
complement the convergence analysis by studying the methods’ 
rate of convergence, (il) To introduce adaptive formulations that 
converge to the optimal argument of the original objective, 
(lii) To perform numerical evaluations of NN-A methods. The 
convergence analysis relates the behavior of NN-A with the 
behavior of (regular) Newton’s method and shows that the method 
goes through a quadratic convergence phase in a specific interval. 
The length of this quadratic phase grows with K and can be 
made arbitrarily large. The numerical experiments corroborate 
reductions in the number of iterations and the communication cost 
that are necessary to achieve convergence relative to distributed 
gradient descent. 

Index Terms —Multi-agent network, distributed optimization, 
Newton’s method. 


I. Introduction 

In decentralized optimization problems a group of agents 
is tasked with minimizing a sum cost when each of them 
has access to a specific summand. They do so by working 
through subsequent rounds of local processing and variable 
exchanges with adjacent peers. This architecture arises natu¬ 
rally in decentralized control g-||6) as well as in wireless |7), 
Q and sensor networks ||^-|fTT[. In these problems agents 
have access to local information but want to achieve a common 
goal, administer a shared resource, or estimate the state of a 
global environment. Decentralized optimization is also relevant 
to large scale machine learning where problems are 

not inherently distributed but are divvied up to process big 
datasets. 

Irrespectively of the specific application, various methods 
have been developed for decentralized optimization. These 
include distributed gradient descent (DGD) p3|-p8| as well as 
distributed implementations of the alternating direction method 
of multipliers 0^ GD-iD and dual averaging p2) , p3) . 
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At the core of all of these methods lies a gradient descent 
iteration that endows them with their convergence properties, 
but also results in large convergence times for problems with 
poor conditioning. In a companion paper, we introduced the 
network Newton family of decentralized optimization methods 
that incorporates second order information into DGD iterations 
to accelerate convergence Q. Methods in the network New¬ 
ton family are derived by introducing a penalty formulation 
of distributed optimization objectives (Section 0 for which 
the resulting Hessians have the same sparsity pattern of the 
underlying network (Section II-Ai. The Hessian inverse that 
is necessary to compute Newton steps is then expressed as a 
Taylor series expansion that we truncate at K terms to obtain 
the Ath member of the network Newton family - which we 
abbreviate as NN-A. These truncations can be computed in a 
distributed manner by aggregating information from, at most, 
A hops away. 

The network Newton methods have been proven to converge 
to the optimal solution of the penalized objective at a rate 
that is at least linear Q. The main goal of this paper is to 
complete the convergence analysis of NN-A by studying its 
rate of convergence (Section [Tlljl. We show that for all iterations 
except the first few, a weighted gradient norm associated with 
NN-A iterates follows a decreasing path akin to the path that 
would be followed by regular Newton iterates (Lemma |^. 
The only difference between these residual paths is that the 
NN-A path contains a term that captures the error of the 
Hessian inverse approximation. Leveraging this similarity, it 
is possible to show that the rate of convergence is quadratic 
in a specific interval whose length depends on the order A of 
the selected network Newton method (Theorem]^. Existence of 
this quadratic convergence phase explains why NN-A methods 
converge faster than DGD - as we indeed observe in numerical 
analyses. It is also worth remarking that the error in the 
Hessian inverse approximation can be made arbitrarily small 
by increasing the method’s order A and, as a consequence, the 
quadratic phase can be made arbitrarily large. 

Given that NN-A solves the minimization of a penalized 
objective, it converges to a point that is close to the optimum. 
To achieve exact convergence we introduce an adaptive version 
of NN-A - which we term ANN-A - that uses a sequence of 
increasing penalty coefficients to achieve exact convergence 
to the optimal solution (Section IV i. We wrap up the paper 
with numerical analyses. We first demonstrate the advantages 
of NN-A relative to DGD for the minimization of a family of 
quadratic objective functions with varying condition number 
and network connectivity (Section 0- As expected, NN-A 
methods reduce convergence times by substantive factors when 
the objective functions are not well conditioned. Advantages 
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in terms of communication cost are less marked because NN- 
K aggregates information from K-hop neighborhoods but still 
substantial. Network Newton is also applied to solve a logis¬ 
tic regression problem. The results reinforce the conclusions 
reached for the quadratic objective problem (Section V-B i. 
Numerical analyses also show that network Newton methods 
with K = 1 and K = 2 tend to work best when measured in 
terms of overall communication cost. Numerical experiments 
for ANN-iT illustrate the tradeoffs that appear in the selection 
of the initial penalty coefficient and its rate of change (Section 
V-A|l. The paper closes with concluding remarks (Section [Vl|. 


Notation. Vectors are written as x € M" and matrices as 
A € Given n vectors x^, the vector y = [xi;... ;x„] 

represents a stacking of the elements of each individual x^. The 
null space of matrix A is denoted by null(A) and the span of a 
vector by span(x). The i-th eigenvalue of matrix A is denoted 
by /ii(A). For matrices A and B their Kronecker product is 
denoted as A(g)B. The gradient of a function /(x) is denoted 
as V/(x) and the Hessian matrix is denoted by V^/(x). 


II. Algorithm definition 


as the Kronecker product of the weight matrix W G 
and the identity matrix I G as well as the vector y := 

[xi;...;x„] as the concatenation of the local vectors x^. It 
follows that the equality constraint Z = W ® I can be satisfied 
if and only if all the local variables are equal, i.e., if and only 
if Xi = • • • = x„. Indeed, since the null space of I — W 
is null(I — W) = span(l) as per the last condition in Q, 
the null space of I — Z must be null(I — Z) = span(l ® I). 
Thus, vectors y := [xi;... ;x„] in the null space of I — Z, 
which, by definition, are the only ones that satisfy the equality 
(I — Z)y = 0 are multiples of 1 (g) I and therefore satisfy 
Xi = • • • = x„. 

Further observe that the matrix Z, being stochastic and 
symmetric, is positive semidefinite. Consequently, the square 
root matrix (I — Z)^/^ exists and has the same null space of 
I — Z. It then follows that (I — Z)^/^y = 0 if and only if the 
components of y satisfy Xi = • • • = x„. In turn, this implies 
that the optimization problem in (|^ is equivalent to 

n 

y* := argmin ^ /*(x,), 

^ i=l 

s.t. (I-Z)i/2y = 0. (5) 


We consider a connected and symmetric network with n 
agents generically indexed by i = l,...,n. The network is 
specified by the n neighborhood sets Ni, each of which is de¬ 
fined as the group of nodes that are connected to i. Nodes have 
access to strongly convex local objective functions /i(x), but 
cooperate to minimize the global cost /(x) := 

n 

X* := argmin/(x) = argmin ^/i (x). (1) 

X X . - 

2=1 

To rewrite the global problem in a form that is suitable for 
distributed implementation we define local variables x^ G 
and rewrite the cost to be minimized as For ^ 

problem formulation equivalent to Q, we have to further add 
the restriction that local variables x^ be the same as neighboring 
variables Xj with j (z Ni, 

n 

:= argmin ^ /*(xi), 

X - 

2=1 

s.t. Xi = Xj, for all i,j G Ni- (2) 

The optimization problems in (|^ and Q are equivalent in the 
sense that x* = x* for all i. This has to be true because the 
feasible set of (|^ is restricted to configurations in which all 
variables x^ are equal given that the network is connected. 

The constraints x^ = xj imposed for all i and all j G N are 
a way of making all local variables equal but there are other 
alternatives. The one that is germane to this paper consists of 
introducing weights Wij that we group in the matrix W G 
]jnxn Yhe weights Wij are chosen so that the matrix W is 
symmetric, row stochastic, and such that the null space of I — 
W is the span of the all one vector 1 

W^=W, Wl = l, null(I-W) =span(l). (3) 
We further define the extended weight matrix 

( 4 ) 


Here, we solve 0 using a penalized version of the objective 
function. To do so we consider a given penalty coefficient 1/a 
and the squared norm penalty function (1/2)||(I— Z)^/^y|p = 
(l/2)y^(I—Z)y associated with the constraint (I — Z)^/^y = 
0. With penalty function and coefficient so defined, we can now 
introduce the penalized objective F(y} := (l/2)y^(I— Z)y-f 
/i(xi) and the penalized optimization problem 

y* := argmin F(y) 

1 ” 

:=argmin- y^(I-Z) y-f a^/*(xj). (6) 

i=l 

As the penalty coefficient 1/a grows, or, equivalently, as a 
vanishes, the optimal argument y* of the penalized problem 
0 converges towards the optimal argument y* of Q and (|^. 
In that sense, 0 is a reasonable proxy for (|^, Q, and the 
equivalent original formulation in ([T]i. 

The property that makes the penalized problem in 0 
amenable to distributed implementation is that its gradients can 
be computed by exchanging information between neighboring 
nodes. This property is the basis for the development of the 
DGD method of fTS) and the NN method of 0. In the 
following section we study the idea of using Newton’s method 
for solving 0. 

A. Newton’s method and Hessian splitting 

We proceed to minimize the penalized objective function 
F{y) in 0 using Newton’s method. The Newton update with 
stepsize e for function F{y) can be written as 

yt+i = y* - e V^F{yt)-^VF{yt), (7) 

where V^F(yt) and WF{yt) are the Hessian and gradient of 
function F evaluated at point y^, respectively. 

To compute the gradient VF(yt) we introduce the vector 
h(y) := [V/i(xi);...; V/„(x„)] that concatenates the local 


Z := W(g)I G 
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gradients V/i(xi). Given the definition of the penalized func¬ 
tion F{y) in (|^ it follows that the gradient of F{y) at y = yt 
is 

gt ■■= VF(yt) = (I-Z)y 4 -f ah(yt). (8) 

The computation of the gradient gt can be distributed through 
the network because Z has the sparsity pattern of the graph. 
Specifically, define the local gradient component gt t as the fth 
element of the gradient gt = [gt^t; • ■ ■; gi,t] and recall that Xt t 
and Xt t+i are the fth components of the vector yt and yt+i- 
The local gradient component at node i is given by 

gi,t = (1 - ^ ( 9 ) 

Using (|^, node i can compute its local gradient using its local 
iterate x^ t, the gradient of the local function V/t(xt t) and the 
Xj^t iterates of its neighbors j S A/i. 

To implement Newton’s method as defined in Q we also 
need to compute the Hessian Hj := V^U(yt) of the penalized 
objective. Start by differentiating twice the objective function 
U in (|^ in order to write the Hessian Hj as 

Hi := V^Fjyt) =I-Z + aGt, (10) 

where the matrix Gt S jgnpxnp ^ block diagonal matrix 
formed by blocks Gu^t € containing the Hessian of the 

fth local function, 

G,,,t = VV*(xm)- (11) 

It follows from and 0 that the Hessian Ht is block 
sparse with blocks H^ t € having the sparsity pattern 

of Z, which is the sparsity pattern of the graph. The diagonal 
blocks are of the form Hit t = (1 ~ wu)! + Q;V^/t(xt^t) and 
the off diagonal blocks are not null only when j G Ni in which 
case Hy t = w^T. 

Recall that for the Newton update in Q, the Hessian inverse 
V^T'(yt)“^ = evaluated at y = yt is required not the 
Hessian Ht. While the Hessian Ht is sparse, the inverse Ht"^ 
is not necessarily sparse. Therefore, the Hessian inverse is not 
necessarily computable in a decentralized setting. To overcome 
this problem we split the diagonal and off diagonal blocks of 
Ht and rely on the Taylor’s expansion of the inverse Hj"^. To 
be precise, write Ht = Dt — B where the matrix Dt is defined 
as 

Dt := aGt + 2 (I - diag(Z)) := aGt + 2 (I - Z^). (12) 

In the second equality we defined Z^ := diag(Z) for future 
reference. Observe that the matrix I — Z^ is positive definite 
because in a connected network the local weights are wa < 1. 
The block diagonal matrix Gt is also positive definite because 
the local functions are assumed strongly convex, it follows from 
these two observations that the matrix Dt is block diagonal 
and positive definite. Further note that the fth diagonal block 
Dii^t S of Dt can be computed and stored by node i as 
Dii^t = + 2(1 — Wii)I using local information 

only. To have Ht = Dt — B we must define B := Dt — Ht. 
Considering the definitions of Ht and Dt in and 
respectively, it follows that 

(13) 


Observe that B is independent of time and only depends on 
the weight matrix Z. As in the case of the Hessian Ht, the 
matrix B is block sparse with blocks B^j G having the 

sparsity pattern of Z, which is the sparsity pattern of the graph. 
Since B is block sparse, node i can compute the diagonal block 
Bit = (1 —Wii)I and the off diagonal blocks B^ = Wijl using 

local information about its own weights. 

1 /2 

Proceed now to factor Dj from both sides of the splitting 
relationship to write Ht = Dy^(I — Dj ^^^BDj ^^^)Dt^^. 
This decomposition implies that the Hessian inverse H^^ can 
be computed from the Taylor series expansion (I — X)“^ = 
with X Therefore, we can write 

oo , 

Ht”^ = Dt”^/^^ (Dt'^/^BDt’^/^) Dt~^^^ (14) 

fc =0 

The sum in ( [l4| converges if the absolute value of all the 
eigenvalues of the matrix D~^/^BD~^/^ are strictly less than 
1 - we prove that this is true in Proposition Truncations 
of this convergent series are utilized to define the family of 
Network Newton methods in the following section. 


B. Network Newton 

Network Newton is defined as a family of algorithms that 
rely on truncations of the series in ( |T4| ). The K\h member of 
this family, NN-X, considers the first K+ \ terms of the series 
to define the approximate Hessian inverse 

^ (15) 

NN-iF uses the approximate Hessian H) as a curvature 
correction matrix that is used in lieu of the exact Hessian 
inverse H-i to estimate the Newton step. I.e., instead of 
descending along the Newton step dj := —Hj"^gt we descend 
along the NN-iT step := —H^^^ gt, which we intend 
as an approximation of dj. Using the explicit expression for 
in ([T5) we write the NN-X step as 

df) = - D7'/" ^ (d.-'/'BD, gt, (16) 

k=0 

where, we recall, the vector gt is the gradient of the objective 
function F{y) defined in ([^. The NN-iT update formula can 
then be written as 

yt+i = yt + e d|^^ (17) 

The algorithm defined by recursive application of ([T^ can 
be implemented in a distributed manner. Specifically, define 
the components d^'^^ G of the NN-iT step = 

...; d^^^] and rewrite ( |l7] i componentwise as 

Xi,t+i = Xi,t + e d(^\ (18) 

To determine the step components d(^^ in ( [T8] l observe that 
considering the definition of the NN-AT descent direction in 
0, Network Newton descent directions can be computed by 


B = I-2Zd + Z. 
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the recursive expression 


d(fc+i) ^ - D,-'gt = D,-' (Bdf) - gt) . (19) 

(k) 

If we expand the product Bd) as a sum and utilize the fact 
that the blocks of the matrix B have the sparsity pattern of the 
graph, we can separate ( [T9] ) into the following componentwise 
recursions 


d(fc+i) ^ pj-i 




B d^^^ — e 


( 20 ) 


That the matrix B is block-sparse permits writing the sum in 
( [20| i as a sum over neighbors, instead of a sum across all nodes. 

In ( |20| i, the matrix blocks Du^t = ctV^/i(xi_t) + 2(l —Wii)!, 
Bii = (1 — Wii)l, and B^ = WijI are evaluated and 
stored at node i. The gradient component g^ t is also stored 
and computed at i upon being communicated the values of 
neighboring iterates x, t [cf. (j^j. Thus, if the NN-fc step 

(k) ' '—' 

components d^ / are available at neighboring nodes j, node 
i can determine the NN-(A: + 1) step component upon 

being communicated that information. We use this property to 
embed an iterative computation of the NN-iT step inside the 
NN-iT recursion in ( fTS] ). For each iteration index t, we compute 
the local component of the NN-0 step dj”^ = 

Upon exchanging this information with neighbors we use ( |20| ) 

f 1) ' ' 

to determine the NN-1 step components d)/. These can be 
exchanged and plugged in ( |20l l to compute Repeating 

this procedure K times, nodes end up having determined their 
NN-iT step component They use this step to update x^ t 

according to ( fTS] ) and move to the next iteration. We analyze 
the convergence rate of the resulting algorithm in Section [^ 
and develop a numerical analysis in Section [V] 


Remark 1 By trying to approximate the Newton step, NN-iT 
ends up reducing the number of iterations required for conver¬ 
gence. Furthermore, the larger K is, the closer that the NN-iT 
step gets to the Newton step, and the faster NN-iT converges. 
We will justify these assertions both, analytically in Section 


however, that reducing the number of iterations reduces the 
computational cost but not necessarily the communication cost. 
In DGD, each node i shares its vector x^ t G with each 
of its neighbors j G Ni- In NN-Ff, node i exchanges not 
only the vector x^ t G with its neighboring nodes, but 
it also communicates iteratively the local components of the 
descent directions G so as to compute the 

descent direction Therefore, at each iteration, node i 

sends \Afi \ vectors of size p to the neighboring nodes in DGD, 
while in NN-iT it sends {K -f IjlA/i] vectors of the same 
size. Unless the original problem is well conditioned, NN-Ff 
also reduces total communication cost until convergence, even 
though the cost of each individual iteration is larger. However, 
the use of large K is unwarranted because the added benefit 
of better approximating the Newton step does not compensate 
the increase in communication cost. 


Ill and numerically in Section M It is important to observe. 


III. Convergence Rate 

Linear convergence of the sub optimality sequence F{yt) — 
F{y*) associated with NN-K iterates yt has been proven in 
0. We improve this result by showing that regardless of the 
choice of K, the rate of convergence is quadratic in a specific 
interval. To prove this result we utilize some of the results in Q 
which we repeat here for completeness. We start by restating 
assumptions that are necessary for the convergence analysis. 

Assumption 1 There exists constants 0 < i5 < A < 1 that 
lower and upper bound the diagonal weights for all i, 

0 < S < Wii < A < 1, i = 1,..., n. (21) 

Assumption 2 The local objective functions f (x) are twice 
differentiable and the eigenvalues of the local objective function 
Hessians are bounded with positive constants 0 < m < M < 
00 , i.e. 

ml ^ VV*(x) A MI. (22) 

Assumption 3 The local objective function Hessians V^/i(x) 
are Lipschitz continuous with respect to the Euclidian norm 
with parameter L. I.e., for all x, x S it holds 

||V^/*(x) - V^/i(x)|| < L||x-x||. (23) 

The upper bound A < 1 on the local weights wu in 
Assumption exits for connected networks. The non-negative 
lower bound <5 on the local weights wu is more a definition 
than a constraint since we may have 5 = 0. Strong convexity 
of the local objective functions ff enforces the existence of 
a lower bound m for the eigenvalues of the local Hessian 
V^/i as in ( |22| ). The upper bound M for the eigenvalues of 
local objective function Hessians V^/i(x) in Assumption 
is equivalent to the assumption that local gradients V/i(x) 
are Lipschitz continuous with parameter M. Assumption 
states that the local objective function Hessians are Lipschitz 
continuous with parameter L. A particular consequence of this 
assumption is that the penalized objective function Hessian 
H(y) := V^F(y) is also Lipschitz continuous with parameter 
aL - see Lemma 1 of ||^. I.e. for all y, y S it holds, 

||H(y)-H(y)|l <aL||y-y||, (24) 

Recall that the block diagonal matrix Dj, being the sum 

of positive definite aGt and 2(1 — Zf), is positive definite 

and, therefore, invertible. Further recall that the matrix B, 

being symmetric and doubly stochastic, has eigenvalues that lie 

between 0 and 1 and is therefore positive semidefinite. These 

facts can be used to prove that the eigenvalues of the matrix 
_ 1/2 _ 1/2 

T>^ 'BT>^ must be nonnegative and strictly smaller than 

1 as we state next ||^ Proposition 2]. 

Proposition 1 Consider the definitions of matrices Dt in OH 
and B in ( |13|) . If Assumptions [7] and hold true, the matrix 
is positive semidefinite and the eigenvalues are 
bounded above by a constant p < 1 

0 A A pl, (25) 

where p := 2(1 — 5)/(2(l — 5) -f am). 
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The result in Proposition [T] makes the expansion in ( [l4| ) 
valid and is used in subsequent proofs. These proofs also rely 
on guarantees that the eigenvalues of the approximate Hessian 

^ (K)~^ 

inverse Hr are positive and finite for all choices of K and 
for all steps t. We state these guarantees next |[^ Lemma 2]. 

Lemma 1 Consider the NN-K method as defined by @-([T7]) 
with the gradient gt as defined in ([^ and the matrices B and 
T>t defined as in ( |1 l| l-( |13| l. If As sumptions^and^hold true, 
the eigenvalues of approximate Hessian inverse are 

bounded as 

AI ^ ^ AI, (26) 

where constants A and A are defined as 

1 A ^ ~ 

2{1 - 6) + aM '^(l-p)(2(l-A) + am)' 

(27) 

The lower bound A > 0 for the eigenvalues of the approx- 
imate Hessian inverse H) ’ guarantees decrement in each 
network Newton iteration. The upper bound A < oo ensures 

A / — 1 

that the norm of the network Newton step ||H) ' gt|| is 
bounded by a factor proportional to the gradient norm |jgt||. 
Both of these results are necessary to show that the network 
Newton direction H) ' gt is a descent direction. This is 
claimed to be true in the following theorem |[^ Theorem 1]. 


Theorem 1 Consider the objective function F{y) as intro¬ 
duced in and the NN-K method as defined by @-([T7]) 
with the gradient gt as defined in ([^ and the matrices B and 
Dt defined as in ([TT)-([T3. If the stepsize e is chosen as 


e = min 


3toA2 

’ [LA3(i^(yo)-F(y*))5_ 



(28) 


and Assumptions Hi and^hold true, the sequence F{yt) 
converges to the optimal argument F{y*) at least linearly with 
constant 1 — C- 7.e., 


F{yt) - F{y*) < (1 - C)‘(F(yo) - F(y*)), (29) 


where the constant 0 < (^ < 1 A explicitly given by 


C := (2 — e)eamX 


ae^LA^{F{yo) - F{y*))i 
6A5 


(30) 


Theorem [T] establishes linear convergence of the sequence of 
penalized objective functions F{yt) generated by NN-AT to the 
optimal objective function F{y*) - which implies convergence 
of yt to the optimal argument y*. This result is identical to 
the convergence behavior of DGD as shown in, e.g., |T7). 
We expect to observe faster convergence for NN-AT relative 
to DGD, since NN-AT uses an approximation of the curvature 
of the penalized objective function F. In the following section 
we show that this expectation is fulfilled and that NN-AT has 
a quadratic convergence phase regardless of the choice of AT. 


method. In particular, the following lemma shows that the 
convergence of the norm of the weighted gradient ||D(l_^/^gt|| 
in NN-AT is akin to the convergence of Newton’s method with 
constant stepsize. The difference is the appearance of a term 
associated with the error of the Hessian inverse approximation 
as we formally state next. 


Lemma 2 Consider the NN-K method as defined by @-([171 
with the gradient gt as defined in ([^ and the matrices B and 
Dt defined as in ([TT)-([T3). If Assumptions [^ p] and [^ hold 
true, the sequence of weighted gradients Dj ^ gt-i-i satisfies 


d; 


- 1/2 


gt-l-1 


< 


(31) 


1-e- 


ep 


K+i) ri + ri(i-c)'^ 


Dt-_Y^g* 


e"r2 


Dt-Y"gt 


where the constants Ti and r 2 are defined as 


ri 

r2 


(aeLA)3/^(f(yo)-f(y*))V4 

A3/3(2(1 — A) -I- am) 

aLA^ 

2A(2(1 - A) + amf/^' 


(32) 


Proof: See Appendix [A| ■ 

As per Lemma|^the weighted gradient norm ||Dt ^^^gt+i|| 
is upper bounded by terms that are linear and quadratic on 
the weighted norm ||D^Y^gt|| associated with the previous 
iterate. This is akin to the gradient norm decrease of Newton’s 
method with constant stepsize. To make this connection clearer, 
further note that for all except the first few iterations the term 
is close to 0 and the relation in can 

be simplified to 


D, 


- 1/2 


gt-l-1 


< 1-' 


ep 


K+l\ 




+ e"r. 


uA’g. 


(33) 


In ( |3^ , the coefficient in the linear term is reduced to (1 — 
e -I- and the coefficient in the quadratic term stays at 

e^r 2 . If, for discussion purposes, we set e = 1 as in Newton’s 
quadratic phase, we see that the upper bound in ( [33] ) is further 
reduced to 


D, 


- 1/2 


gt+1 


< 


^x+i|!D-_i/2g^ll+r2||D7_Y"gtf 


(34) 


We do not obtain quadratic convergence as in Newton’s method 
because of the term p^3-^jDj_Y^gt|l- However, since the 
constant p (cf. Proposition [U is smaller than 1 the term p^+3 
can be made arbitrarily small by increasing the approximation 
order AT. Equivalently, this means that by selecting AT to be 
large enough, we can make the quadratic term in ( |3^ dominant 
and observe a quadratic convergence phase. The boundaries of 
this quadratic convergence phase are formally determined in 
the following Theorem. 


A. Quadratic convergence phase 

To characterize convergence rate of NN-AT, we first study 
the difference between this algorithm and (exact) Newton’s 


Theorem 2 Consider the NN-K method as defined by 

with the gradient gt as defined in and the matrices 
B and D* defined as in ([TT)-(@. Define the sequence rjt ■= 
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[(1 — e + + ri(l — the time to as 

the first time at which sequence r]t is smaller than 1, i.e. to '■= 
argmiiijlt | rjt < 1}. IT] and^hold true, for 

all t > to when the sequence |jD(_*^{gt|| satisfies 


\/^(l - yWt) 

e^To 


< 


d: 


— 1 St 


< 


i-Vvt 

e^r. 


the rate of convergence is quadratic in the sense that 


D 


- 1/2 


gt-l-1 


< 


e'r. 


1-v^ 




(35) 


(36) 


Proof : Considering the definition of rjt we can rewrite the 
result of Lemma |2] as 


D, 


- 1/2 


gt-l-1 


< Vt 


gt 


+ e"r2 


d.-Tb. 


(37) 


we use this expression to prove that the inequality in ( |36| l holds 
true. To do so rearrange terms in the first inequality in ([3^ and 
write 

r- < ^ 


D-1/2 


I-s/m 

Multiplying both sides of ( |38| l by y^|jDj(_Y^gt|| yields 


(38) 


m 


Dt_i gt 


< 


i-s/m 


d.--Tb. 


(39) 


Substituting 77t||D(_\gt|j in ( [JT] ! for its upper bound in 
implies that 


d: 


- 1/2 


gt+1 


< 


i-v^ 


D.-Y^gt 

or-Y^gt 




Dr-Y^gt 


(40) 


i-s/m 

To verify quadratic convergence, it is necessary to prove 
_/2 

that the sequence ||Dj_{ gi|| of weighted gradient norms is 
decreasing. For this to be true we must have 

e'r. 


d: 


- 1/2 
-i gt 


< 1 . 


(41) 


1 - 

But ( |4T] ) is true because we are looking at a range of gradients 
that satisfy the second inequality in 


Algorithm 1 Network Newton-iT method at node i 


1 : 

2 : 

3: 

4: 

5: 

6 : 

7: 

8 : 

9: 

10 : 

11 : 

12 : 

13: 


function Xi = NN-A(q, Xi, tol) 

repeat 

B matrix blocks: = (1 — wu)I and By = WijI 

D matrix block: Tin = + 2(1 — wu)! 

Exchange iterates Xi with neighbors j G A/). 

Gradient: gi = (1 — Wi/xi — ^ WijXj + aVfi{xi). 


jeAfi 

Compute NN-0 descent direction = —Dj/^gi 
for fc = 0,..., A - 1 do 

Exchange elements of the NN-fc step with neighbors 


NN-(fc -b 1) step: df= D/ 


E Bydf-g 


end for 

Update local iterate: Xi = Xi -b e d^^\ 
until 11 gill < tol 


This factor, as it follows from the definition in Proposition 
is = [2(1 — ^)/(2(l — (5) -b am)]^^^. Thus, other than 

increasing K, we can make p small by increasing the product 
am. That implies making the inverse penalty coefficient a 
large relative to the smallest Hessian eigenvalue of the local 
functions fi [cf. ([2^]. This is not possible if we want to keep 
the solution y* of ([^ close to the solution of y* of Q. This 
calls for the use of adaptive rules to decrease the inverse penalty 
coefficient a as we elaborate in Section HV] Further observe 
that p is independent of the condition number M/m of the 
local objectives. Making p small is an algorithmic choice - 
controlled by the selection of a and K -, and not a property 
of the function being minimized. 

Remark 3 For a quadratic objective function F, the Lipschitz 
constant for the Hessian is L = 0. Then, the optimal choice of 
stepsize for NN-AT is e = 1 as a result of stepsize rule in ( [28] l. 
Moreover, the constants for the linear and quadratic terms in 
(31l are Fi = r 2 = 0 as it follows from their definitions in 
(32i. For quadratic functions we also have that the Hessian of 
the objective function Hj = H and the block diagonal matrix 
Dt = D are time invariant, which implies that we can rewrite 
( |3T] ) as 

||D-i/2g,+i|| <p^+i||D-i/2gJ. (42) 


As per Theorem yt is converging to y* at a rate that 
is at least linear. Thus, the gradients gt will be such that at 
some point in time they satisfy the rightmost inequality in 
At that point in time, progress towards y* proceeds at 
a quadratic rate as indicated by ( |36l l. This quadratic rate of 
progress is maintained until the leftmost inequality in ( [LS] ) is 
satisfied, at which point the linear term in ( [3T] i dominates and 
the convergence rate goes back to linear. We emphasize that 
the quadratic convergence region is nonempty because we have 
yffft < 1 for all t > to- Furthermore, making e = 1 and K 
sufficiently large it is possible to reduce rp arbitrarily and make 
the quadratic convergence region last longer. In practice, this 
calls for making K large enough so that ^/rp is close to the 
desired gradient norm accuracy. 

Remark 2 Making small reduces the factor in front of 

the linear term in ([34|i and makes the quadratic phase longer. 


We know that when applying Newton’s method to quadratic 
functions we converge in a single step. This property follows 
from ( |42l i because Newton’s method is equivalent to NN-AT as 
K —t oo. The expression in ( |42l i states that NN-A" converges 
linearly with a constant decrease factor of p^'^^ per iteration. 
This factor is independent of the condition number of the 
quadratic function; see Remark This in contrast with first 
order methods like DGD that converge with a linear rate that 
depends on the condition number of the objective. 

IV. Implementation issues 

As mentioned in Section [H] NN-AT does not solve Q or its 
equivalent 0. but the penalty version introduced in Q. The 
optimal solutions of the optimization problems in (|^ and 
are different and the gap between them is of order 0{a), [ [l7] |. 
This observation implies that by setting a decreasing policy 
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Algorithm 2 Adaptive Network Newton-itT method at node i 
Require: Iterate x^. Initial parameter a. Flags Sij = 0. Factor p < 1. 

1: for t = 0,1, 2,... do 

2: Call NN-7C function: Xi = NN-A'(q,X i, tol) 

3: Set Sii = 1 and broadcast it to all nodes. 

4: Set Sij = 1 for all nodes j that sent the signal Sjj = 1. 

5: if Sij = 1 for all j = 1,..., n then 

6: Update penalty parameter a = rja. 

7: Set Sij = 0 for all j = 1,..., n. 

8 : end if 

9: end for 


for a, or equivalently, an increasing policy for the penalty 
coefficient 1/a, the solution of (|^ approaches the minimizer 
of (|^, i.e. y* —y* for a —0. 

There are various possible alternatives to reduce a. Given the 
penalty method interpretation in Section it is more natural 
to consider fixed penalty parameters a that are decreased after 
detecting convergence to the optimum argument of the function 
F{y) [cf (|^]. This latter idea is summarized under the name of 
Adaptive Network Newton-iT (ANN-iC) in Algorithm]^ where 
a is reduced by a given factor rj < 1. 

Specihcally, ANN-K relies on Algorithm [T] which receives 
an initial iterate x^, a penalty parameter a, and a given 
tolerance tol (Step 1) and runs the local NN-iT iterations in 
(U and ( [20| ) for node i until the local gradient norm ||gi|| 
becomes smaller than tol (Step 13). The descent iteration in 
(|T^ is implemented in Step 12. Implementation of this descent 
requires access to the NN-AT descent direction d) ^ which is 
computed by the loop in steps 7-11. Step 7 initializes the loop 
by computing the NN-0 step = —Df t. The core of 
the loop is in Step 10 which corresponds to the recursion in 
Step 8 stands for the variable exchange that is necessary to 
implement Step 7. After K iterations through this loop, the NN- 
K descent direction is computed and can be used in Step 
12. Both, steps 7 and 10, require access to the local gradient 
component g^ j. This is evaluated in Step 6 after receiving 
the prerequisite information from neighbors in Step 5. Steps 
3 and 4 compute the blocks Hu^t, and Gu^t that are 

also necessary in steps 7 and 10. This process is repeated until 
||gi|| < tol (Step 13). Notice however, that if Algorithm is 
called with a variable with ||gi|j < tol we still run at least 
one iteration of NN-AT. 

ANN-K calls Algorithm [T] in Step 2 of Algorithm The 
factor a is subsequently reduced by the factor ry < 1 as 
indicated in Step 6 of Algorithm]^ that implements the replace¬ 
ment a = rja. The rest of Algorithm is designed to handle 
the fact that a small local gradient norm does not necessarily 
imply a small global gradient norm. To handle this possible 
mismatch, flag variables Sy are introduced at node i to signal 
the fact that node j has reached a local gradient gj with norm 
llgjil < tol. Whenever node i completes a run of Algorithm 
[T] it broadcasts the signal su to all other nodes (Step 3) and 
updates the variables to = 1 for all the nodes that sent 
the signals sjj = 1 while Algorithm was executing (Step 4). 
If all the variables Sjj = 1 (Step 5) it must be that this is true 
for all nodes and it is thus safe to modify a (Step 6). The flag 


variables are reset to Sij = 0 and Algorithm is called with 
the reduced a. 

As is typical of penalty methods there are tradeoffs on the 
selection of the initial value of a and the decrease factor rj. 
Small values of the initial penalty parameter and a and factor 
T] results in sequence of approximate problems having solutions 
y* that are closer to the actual solution y*. However, problems 
with small a may take a large number of iterations to converge 
if initialized far from the optimum value because the constant 
p approaches 1 when a is small - as we discussed in Remark 
1^ It is therefore better to initialize Algorithm with values of 
a that are not too small and to decrease a by a factor rj that is 
not too aggressive. We discuss these tradeoffs in the numerical 
examples of Section |V-A| 


V. Numerical analysis 


We compare the performance of DGD and different ver¬ 
sions of network Newton in the minimization of a distributed 
quadratic objective. The comparison is done in terms of both, 
number of iterations and number of information exchanges. For 
each agent i we consider a positive definite diagonal matrix 
Ai G Sp and a vector G to define the local objective 
function /i(x) := (l/2)x^AiX-F bfx. Therefore, the global 
cost function /(x) is written as 

n 

/(x) := 2+ bfX . (43) 

The difficulty of solving ( |43| l is given by the condition number 
of the matrices Aj. To adjust condition numbers we generate 
diagonal matrices A, with random diagonal elements an. The 
first p/2 diagonal elements an are drawn uniformly at random 
from the discrete set {1,10“^,..., 10“^} and the next p/2 are 
uniformly and randomly chosen from the set {1,10^,..., 10^}. 
This choice of coefficients yields local matrices Ai with 
eigenvalues in the interval [10“^,10^] and global matrices 
eigenvalues in the interval [nl0“^, nlO^]. The 
linear terms hf x are added so that the different local functions 
have different minima. The vectors b^ are chosen uniformly at 
random from the box [0,1]^. 

For the quadratic objective in ( |4^ we can compute the 
optimal argument x* in closed form. We then evaluate conver¬ 
gence through the relative error that we define as the average 
normalized squared distance between local vectors x^ and the 
optimal decision vector x*. 


et := - 




^*l|2 


i=l 


.*112 


(44) 


The network connecting the nodes is a d-regular cycle where 
each node is connected to exactly d neighbors and d is assumed 
even. The graph is generated by creating a cycle and then 
connecting each node with the d/2 nodes that are closest in 
each direction. The diagonal weights in the matrix W are set 
to wn = 1/2 -F l/2{d + 1) and the off diagonal weights to 
Wij = \/2{d -F 1) when j G Mn 

In the subsequent experiments we set the network size to 
n = 100, the dimension of the decision vectors to p = 4, the 
condition number parameter to ^ = 2, the penalty coefficient 











Fig. 1: Convergence of DGD, NN-0, NN-1, and NN-2 in terms of 
number of iterations. The network Newton methods converges faster 
than DGD. Furthermore, the larger K is, the faster NN-TT converges. 


inverse to a = 10“^, and the network degree to d = 4. The 
network Newton step size is set to e = 1, which is always 
possible when we have quadratic objectives [cf. Remark]^. 
Figure [T] illustrates a sample convergence path for DGD, NN- 
0, NN-1, and NN-2 by measuring the relative error et in ( |44| ) 
with respect to the number of iterations t. As expected for a 
problem that doesn’t have a small condition number - in this 
particular instantiation of the function in ( |43] l the condition 
number is 95.2 - different versions of network Newton are 
much faster than DGD. E.g., after f = 1.5 x 10^ iterations 
the error associated which DGD iterates is et « 1.9 x 10“^. 
Comparable or better accuracy e* < 1.9 x 10“^ is achieved in 
t = 132, t = 63, and f = 43 iterations for NN-0, NN-1, and 
NN-2, respectively. 

Further recall that a controls the difference between the 
actual optimal argument y* = [x*;...;x*] [cf. and the 
argument y* [cf. to which DGD and network Newton 
converge. Since we have a = 10“^ and the difference between 
these two vectors is of order 0{a), we expect the error in 
( |44l l to settle at e* « 10 The etTor actually settles at 
et « 6.3 X 10“^ and it takes all three versions of network 
Newton less than t = 400 iterations to do so. It takes 
DGD more than t = 10"^ iterations to reach this value. 
This relative performance difference decreases if the problem 
has better conditioning but can be made arbitrarily large by 
increasing the condition number of the matrix 
number of iterations required for convergence can be further 
decreased by considering higher order approximations in ([T^. 
The advantages would be misleading because they come at the 
cost of increasing the number of communications required to 
approximate the Newton step. 

To study this latter effect we consider the relative perfor¬ 
mance of DGD and different versions of network Newton 
in terms of the number of local information exchanges. As 
pointed out in Remark [T] each iteration in NN-K requires 
a total of AT -F 1 information exchanges with each neighbor, 
as opposed to the single variable exchange required by DGD. 
After t iterations the number of variable exchanges between 
each parr of neighbors is t for DGD and {K + l)t for NN- 
K. Thus, we can translate Figure [T] into a path in terms of 



Fig. 2: Convergence of DGD, NN-0, NN-1, and NN-2 in terms of 
number of communication exchanges. The NN-A methods retain 
the advantage over DGD but increasing K may not result in faster 
convergence. For this particular instance it is actually NN-1 that 
converges fastest in terms of number of communication exchanges. 




Fig. 3: Histograms of the number of information exchanges required 
to achieving accuracy et < 10“^. The qualitative observations made 
in figures and 1^ hold over a range of random problem realizations. 

number of communications by scaling the time axis by (AT+l). 
The result of this scaling is shown in Figure The different 
versions of network Newton retain a significant, albeit smaller, 
advantage with respect to DGD. Error et < 10“^ is achieved 
by NN-0, NN-1, and NN-2 after (AT -F l)f = 3.7 x 10^, 
(AT -F l)f = 3.1 X 10^, and (AT -F l)f = 3.4 x 10^ variable 
exchanges, respectively. When measured in this metric it is no 
longer true that increasing AT results in faster convergence. For 
this particular problem instance it is actually NN-1 that con¬ 
verges fastest in terms of number of communication exchanges. 

For a more comprehensive evaluation we consider 10^ 
different random realizations of ( |43| where we also randomize 
the degree d of the d-regular graph that we choose from the 
even numbers in the set [2,10]. The remaining parameters are 
the same used to generate figures 0 and 1 ^ For each joint 
random realization of network and objective we run DGD, NN- 
0, NN-1, and NN-2, until achieving error et < 10“^ and record 
the number of communication exchanges that have elapsed - 
which amount to simply t for DGD and (AT -F l)f for NN. The 
resulting histograms are shown in Figure The mean times 
required to reduce the error to e* < 10“^are 4.3 x 10^ for 
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Fig. 4: Convergence of adaptive DGD, NN-0, NN-1, and NN-2 for 
ao = 10“^. network Newton methods require less iterations than 
DGD. 



Fig. 5: Convergence of Adaptive DGD, NN-0, NN-1, and NN-2 for 
ao = 10“^. ANN methods require less iterations than DGD and 
convergence of all algorithms are faster relative to the case that qq = 
10 "^ 


DGD and 4.0 x 10^, 3.5 x 10^, and 3.7 x 10^ for NN-0, NN-1, 
and NN-2. As in the particular case shown in figures [T] and 
NN-1 performs best in terms of communication exchanges. Ob¬ 
serve, however, that the number of communication exchanges 
required by NN-2 is not much larger and that NN-2 requires 
less computational effort than NN-1 because the number of 
iterations t is smaller. 


A. Adaptive Network Newton 

Given that DGD and network Newton are penalty methods it 
is of interest to consider their behavior when the inverse penalty 
coefficient a is decreased recursively. The adaptation of a for 
NN-AT is discussed in Section IV where it is termed adaptive 
(A)NN-Ar. The same adaptation strategy is considered here for 
DGD. The parameter a is kept constant until the local gradient 
components t become smaller than a given tolerance tol, i.e., 
until ||gi,t|| < tol for all i. When this tolerance is achieved, the 
parameter a is scaled by a factor p < 1, i.e., a is decreased 
from its current value to rja. This requires the use of a signaling 
method like the one summarized in Algorithm for ANN-iT. 

We consider the objective in ( |4^ and nodes connected by a 
d-regular cycle. We use the same parameters used to generate 
figures and The adaptive gradient tolerance is set to 
tol = 10“^ and the scaling parameter to rj = 0.1. We consider 
two different scenarios where the initial penalty parameters 
are a = ao = 10”^ and a = ao = 10“^. The respective 
error trajectories et with respect to the number of iterations 
are shown in figures]^- where ao = 10“^ - and|^- where 
ao = 10“^. In each figure we show e* for adaptive DGD, 
ANN-0, ANN-1, and ANN-2. Both figures show that the ANN 
methods outperform adaptive DGD and that larger K reduces 
the number of iterations that it takes ANN-iT to achieve a target 
error. These results are consistent with the findings summarized 
in figures [T][^ 

More interesting conclusions follow from a comparison 
across figures 1^ and 1^ We can see that it is better to start 
with the (larger) value a = 10“^ even if the method initially 
converges to a point farther from the actual optimum. This 
happens because increasing a decreases the constant p = 
2{1-S)/{2{1-5) + am). 


B. Logistic regression 


For a non-quadratic test we consider the application of 
network Newton for solving a logistic regression problem. 
In this problem we are given q training samples that we 
distribute across n distinct servers. Denote as qi the number 
of samples that are assigned to server i. Each of the training 
samples at node i contains a feature vector u,; € and a 
class Vii G {—1,1}. The goal is to predict the probability 
P (u = 1 I u) of having label v = 1 when given a feature 
vector u whose class is not known. The logistic regression 
model assumes that this probability can be computed as 
P (u = 1 I u) = 1/(1 + exp(—u^x)) for a linear classifier 
X that is computed based on the training samples. It follows 
from this model that the regularized maximum log likelihood 
estimate of the classifier x given the training samples (uii,Vii) 
for I = 1,..., qi and i = 1,..., n is given by 


X* = argmin/(x) 


(45) 


A 


n Qi 


:= argmin-||x r+EE log l-F exp{-Viiufix) 


i=l 1=1 


where we defined the function /(x) for future reference. The 
regularization term (A/2)||x|p is added to reduce overfitting to 
the training set. 

The optimization problem in ( |45l l can be written in the form 
of the optimization problem in ([T]). To do so simply define the 
local objective functions ft as 


/.(x) = ^||xf 


^log 1 + exp(-'!;,zu^x) 


(46) 


1=1 


and observe that given this definition we can write the objective 
in ( |45l l as /(x) = J2^=i We can then solve ( |4^ in a 

distributed manner using DGD and NN-iT methods. 

In our experiments we use a synthetic dataset where each 
component of the feature vector Un with label vu = 1 is 
generated from a normal distribution with mean p and standard 
deviation cr+, while sample points with label vu = —1 are 
generated with mean —p and standard deviation cr_. The 
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Fig. 6: Convergence of DGD, NN-0, NN-1, and NN-2. network 
Newton methods for a linearly separable logistic regression. 

network is a d-regular cycle. The diagonal weights in the matrix 
W are set to wu = l/2 + l/2((i+l) and the off diagonal 
weights to Wij = l/2{d+ 1) when j G Ni- We set the feature 
vector dimension to p = 10, the number of training samples per 
node at = 50, and the regularization parameter to A = 10“^. 
The number of nodes is n = 100 and the degree of the d-regular 
cycle is d = 4. 

We consider hrst a scenario in which the dataset is linearly 
separable. To generate a linearly separable dataset the mean is 
set to /r = 3 and the standard deviations to (t+ = cr_ = 1. Fig¬ 
ure 1^ illustrates the convergence path of the objective function 
F{y) [cf. (|^] when the penalty parameter is a = 10 ^ and the 
network Newton step size is e = 1. The reduction in the number 
of iterations required to achieve convergence is a little more 
marked than in the quadratic example considered in figures [T]|^ 
The objective function values F{yt) for NN-0, NN-1 and NN- 
2 after t = 500 iterations are below 1.6 x 10“"^, while for DGD 
the objective function value after the same number of iterations 
have passed is F{yt) = 2.6 x 10“^. Conversely, achieving 
accuracy F{yt) = 2.6 x 10“^ for NN-0, NN-1, and NN-2 
requires 68, 33, and 19 iterations, respectively, while DGD 
requires 500 iterations. Observe that for this example NN-2 
performs better than NN-1 and NN-0 not only in the number of 
iterations but also in the number of variable exchanges required 
to achieve a target accuracy. 

We also consider a case in which the dataset is not linearly 
separable. To generate this dataset we set the mean to p, = 2 
and the standard deviations to a+ = a- = 2. The penalty 
parameter is set to a = 10“^ and the network Newton step size 
to e = 1. The resulting objective trajectories F{yt) of DGD, 
NN-0, NN-1, and NN-2 are shown in Figure The advantages 
of the network Newton methods relative to DGD are less 
pronounced but still significant. In this case we also observe 
that NN-2 performs best in terms of number of iterations and 
number of communication exchanges. 

VI. Conclusions 

Network Newton is a decentralized approximation of New¬ 
ton’s method for solving decentralized optimization problems. 
This paper studied convergence properties and implementation 
details of this method. Network Newton approximates the 



Fig. 7: Convergence of DGD, NN-0, NN-1, and NN-2. network 
Newton methods for a non-linearly separable logistic regression. 

Newton direction by truncating a Taylor series expansion of 
the exact Newton step. This procedure produces a class of 
algorithms identified by K, which is the number of Taylor 
series terms that network Newton uses for approximating the 
Newton step. The algorithm is called NN-iT when we keep 
K terms of the Newton step Taylor series. Linear convergence 
of NN-iT is established in a companion paper Q. Here, we 
completed the convergence analysis of NN-iT by showing that 
the sequence of iterates generated by NN-iT has a quadratic 
convergence rate in a specific interval. A quadratic phase 
exists for all choices of K, but this phase can be made 
arbitrarily large by increasing K. The analysis presented here 
also shows that for the particular case of quadratic objective 
functions, the convergence rate of NN-AT is independent of 
the condition number of the objective function. This is in 
contrast to distributed gradient descent methods that require 
more iterations for problems with larger condition number. 
Numerical analyses compared the performances of distributed 
gradient descent and NN-AT with different choices of AT for 
minimizing quadratic objectives with large condition numbers 
as well as the log likelihood objective of a logistic regression 
problem. In either case we observe that all NN-AT methods 
work faster than distributed gradient descent in terms of number 
of iterations and number of communications required to achieve 
convergence. Overall, the theoretical and numerical analyses 
in this paper prove that NN-AT achieves the design goal of 
accelerating the convergence of distributed gradient descent 
methods. 

We further analyzzed a tradeoff on the selection of a penalty 
parameter that controls both, the accuracy of the optimal 
objective computed by network Newton methods and the rate 
of convergence. We proposed an adaptive version of network 
Newton (ANN) that achieves exact convergence by executing 
network Newton with an increasing sequence of penalty co¬ 
efficients. Numerical analyses of ANN show that it is best to 
initialize penalty coefficients at moderate values and decrease 
them through moderate factors. 

Appendix A 
Proof of Lemma[2] 

To prove the result in Lemma we first use the Funda- 
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mental theorem of Calculus and the Lipschitz continuity of the 
Hessians Ht H(yt) = V^F(yt) to prove the following 
Lemma. 

Lemma 3 Consider the NN-K method as defined in @-([171. 
If Assumption^holds true, then 


|gt+i - gt - Ht (yt+i - yt)|| < ^||yt+i -ytf • 


gt-i-i — gt 


|gt+i - gt - H(yt)(yt+i - yt)|| < 


/o 

aL, 


l|yt-ri - yt| 


Proof of Lemma [^ In this proof to simplify the notation we 
use to indicate the approximate Hessian inverse . 

Recall the result in (@. Considering the update formula for 


NN-7f in ( [Thl l, the term yt+i — yt can be substituted by 
—eH^^gt. Making this substitution into (|47|l implies that 


(54) 


The definition of matrix norm implies that the norm of product 
Dt’^^^(gt+i - gt + is bounded above as 



e^aL 


gt+1 - gt + eHtHi ^gt 

CM 

VI 

Ht ^gt 


(47) 


d; 


- 1/2 


gt+1 - gt + eHtHj ^gt 


< 


Proof: Considering the definitions of the objective function 
gradient gt = g(yt) = VF(yt) and Hessian Ht = H(yt) = 
V^L'(yt), the Fundamental Theorem of calculus implies that 


D 


- 1/2 


eHtH-igt 


(55) 


+ f H(yt+a;(yt-Hi-yt)) {yt+i - yt)duj. (48) 
Jo 

Adding and subtracting the integral H(yt) (yt-i-i — yt)du} 
to the right hand side of (|48]l yields 

gt-ri = gt + / H(yt) (yt-ii - yt)duj (49) 

Jo 

+ / [H(yt + u;(yt+i - yt)) - H(yt)] (yt-^i - yt)duj. 
Jo 

Note that H(yt) (yt-i-i — yt) is not a function of the vari¬ 
able w and the first integral in ( |49l l can be simplified as 
fo H(yt) (yt+i-yt)dw = H(yt) (yt+i-yt). By considering 
this simplification and regrouping the terms in ( |49l l we obtain 

gt-ri - gt - H(yt)(yt+i-yt) = (50) 

/ [H(yt + a;(yt+i - yt)) - H(yt)] (yt-ii - yt)duj. 
Jo 

We proceed by computing the norm of both sides of ( [50l l. 
Observing that for any vector a the inequality /||a||dw < 
II / aduj\\ holds and considering that the product of norms is 
greater than the norm of the respective product, it follows that 


gt-i-i ~ gt 

Substituting ||gt-i-i — gt -f eHjHj"^gt|| in the right hand side 
of (|55|l by the upper bound in (|54li leads to 


d: 


- 1/2 


gt+1 -gt + eHtHt 

e^aL 


-1 


gt 


< 


D 


- 1/2 


Hr'gt 


(56) 


Observe that the triangle inequality states that for any vec¬ 
tors a and b, and a positive constant C, if the relation 
lla— b|| < C holds true, then |ja|| < ||b|| + C. By setting 


D 


-1/2, 


{€^aL/2)\\D\ 


gt+i, b := D7'/"(gt - eH*H,-') and C := 
and considering the relation in 


- 1 / 2 , 


iHr^gtf 


we obtain that the inequality ||a — b|j < C is satisfied. 
Therefore, ||a|| < ||b|j + C holds true which is equivalent to 




< 


D 


- 1/2 


gt - eHtHf^gt 


e^aL 


D 


- 1/2 


Hr'gt 


(57) 


By rewriting the term ^^^gt as the sum (1 —e)(D( ^^^gt)-|- 
e(Dj"^^^gt) and using the triangle inequality we can update the 
right hand side of (|5^ as 


1/2. 


d: 


- 1/2 


gt-l-1 


< (1-e) 


D 


-1/2. 


(51) 


D, 


- 1/2 


[ l|H(yt -f a;(yt+i - yt)) - H(yt)|| |iyt+i - yt\\duj. 
Jo 

Based on (@, the Hessians H(yt) are Lipschitz continuous 
with parameter aL. Therefore, we can write 

l|H(yt -f w(yt+i -yt))-H(yt)|| < aLa;||yt+i - yt||. (52) 


e^aL 


d: 


gt 



I- 

HtHt'i 

-1/2 

t 

H/'gt 


(58) 


Observe that the Hessian is decomposed as H* = Dj — B 
and the approximate Hessian inverse is given by := 


Considering these two 
relations and using the telescopic cancelation we can show that 
I - = (BDt~^)^+^ This result is studied with more 

details in Lemma 3 of ID- Therefore, we can write 

1 K+l 


1 / 2 , 


-l/2NfcT-,-l/2 


Substituting the upper bound in ( |52l l for the term 
l|H(yt -f w(yt+i - yt)) - H(yt)|| in (@ leads to 

||gt+i - gt - H(yt)(yt+i - yt)|| 

< / aLuiWyt+i-ytW^dw 
Jo 


D 


- 1/2 


I-HtH,- 


gt 


D 


- 1/2 


BD, 


- 1/2 


D, 


- 1/2 


gt 

(59) 


(53) 


Based on Proposition the eigenvalues of the matrix 
are bounded by 0 and p. This observation in 


d: 


- 1/2 


BD 


- 1/2 


_ 2/2 _ 1/2 

association with the symmetry of ' BD^ ' yields 


where the equality in is valid since doj = 1/2. 

The inequality in ( [Sp yields the result in ( |47l i considering the 
notation Ht = H(yt). ■ 


d: 


1/2BD-1/2' 


1 K+l 


(60) 

The simplification in ( |59j l and the upper bound in ( |60l l guar¬ 
antee that the norm ||D. ^^[I — H/Hr^lg/|| is upper bounded 
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by 


d: 


- 1/2 


i-H*Hr 


St 


<P 


K+l 


D 


- 1/2 


St 


(61) 


d: 


- 1/2 


gt+1 


< (1 - e) 




+ ep 


K+l 


D, 


- 1/2 


St 


+ 


ae^L 


d: 


(62) 


Hr'g* 

By grouping the first two summands in (|62li and using the 


- 1/2 


inequality HH^ ^gt|| < ||Hj ^|l|igt|| we can write 


rA-l/2 
Dj gt+i 


< (l-e + ep^+^) 
ae^L 




d: 


1/2 


H 


-1 


Igtll - (63) 


^-1/2, 


Now we proceed to find an upper bound for ||D( ' g(| 
in terms of ||D(_Y^gt|j to set a recursive inequality for the 


sequence ||D^_Y^gj 


difference ||D( ^ 


We first show that the norm of the 
is bounded above as 




(2(1 — A) + amY 


By substituting the upper bound in 
l|Dt-Dt-i|| in ( [65l ) we obtain that 

aL 


Dr^-D: 


< 


(2(1 — A) + amY 


\yt - yt-i\ 


(67) 


Note that the absolute value of the inner product |g^(Dj ^ — 
Dr-\)g*i is bounded above by the product ||D( ^ — 
D^^ill IlgtlP- Considering the upper bound for ||D(”^ —D^\|| 
in the term \g,f (D* ^ — D 4 _\)gt| is bounded above by 


|gf(Dt ^ -Dt_\)gJ < 


aL ||yt - Yt-i 


|gtl 


( 68 ) 


(2(1 — A) + amY 
Considering the triangle inequality, and observing the sim¬ 
plifications |gfD7_\gt| = ||D7_Y^gt|j2 and |gfDt^^gt| = 


id: 


- 1/2 


gt\\, we can rewrite 


as 


Substituting the upper bound in ( |6T] ) for the second summand 
in the right hand side of (|58|> implies that 


d;''% 


< 




aL llyt -y4 i||||gt| 


(69) 


(2(1 — A) -I- amY 

Observe that for any three constants a, b and c, if the relation 
< Y + Y holds, then the inequality |a| < |6| -I- |c| is 
valid. By setting a := ||D7^/^gt||, b := ||D7_Y^gt||, and c := 
{aL\\yt - yt_i||)^^^ ||gt||/(2(l - A) -|- am), we realize that 
Y < b'^ + Y holds true according to (ig. Hence, we obtain 
that the relation |a| < |6| -I- |c| holds and we obtain that 


D 


- 1/2 


St 


< 


D 


—1 St 


^ {aL\\yt - yt-iW)^^'^ ligtil 


2(1 — A) -I- am 

Considering the update of NN-K in ( [T7] i we can substitute 
Yt — yt-i by —LLlY^igt-i- Applying this substitution into 
( |70| ) implies that 


D 


- 1/2 


St 


< 


Dt-Y'gt 


r 


Y/^ u n 

aeL 

Hr-igt-i 

Ilgtll 


2(1 — A) -|- ( 


(71) 


|D,-il|D*-Dt_i||||D-_\||. (64) 

Factoring and from the left and right sides of 

respectively, follows the relation in ( |64l l. Ob¬ 
serve that the eigenvalues of the matrices Dj and D(_i 
are bounded below by am + 2(1 — A). Consequently, the 
eigenvalues of the matrices and are bounded above 
by Xjiam + 2(1 — A)). Therefore, we can update the upper 
bound in ( |64l i as 

'|Dr^ - Dr_\ II < TITTI -^ ||Dt - Dt_i|l . (65) 


If we substitute ||D( ^^^gt|| by the upper bound in ( |7T] l and 
substitute ||H^\gt_i|| by the upper bound ||H^ L\||||g(_i||, the 
inequality in (|6^ can be written as 


d: 


- 1/2 


gt+1 


<( 1 - 


(l - e + 


aeL 


K+l 


DA“g. 


H 


t-1 


|gt-l| 


1/2 


aYL 


D, 


- 1/2 


2(1 — A) -I- am 
2 


|gt| 


H: 


|gt| 


(72) 


The next step is to show that the block diagonal matrices 
Dj are Lipschitz continuous with parameter aL. Notice that 
the difference Dj — Dt_i can be simplified as a(Gt — Gt_i). 
Moreover, the difference of two consecutive Hessians can be 
simplified as Ht — H(_i = Q;(Gt — Gt_i). Therefore, we 
obtain that Dj — Dt-i = Hj — Ht_i. This observation in 
association with the Lipschitz continuity of the Hessians with 
parameter aL, i.e., ||Ht - Ht_i|| < aL\\yt - yt-i||, implies 
that 

||Dt-Dt_i|| < aL||yt-yt_i||. (66) 


Due to the fact that for a positive definite matrix the norm 
of its product by a vector is always larger than its minimum 

eigenvalue multiplied by the norm of the vector, we can 

l/2^ 


write /r™„(Dj_Y^)||gt|| < ||Dj_Y^gt||. ReaiTanging the terms 
yields 


IgtII < 


1 




a'-’t-i ) 


d; 


—1 St 


(73) 


Note that the eigenvalues of the matrix Dt_i are upper 
bounded by 2(1 — (5) -I- aM. Hence, 1/-Y(2(1 — 5) -I- aM) 
is a lower bound for the eigenvalues of the matrix 
This observation implies that the upper bound in ( |73] ) can be 
updated as 


for the norm 


jgill < (2(l-^)+aM)i/2 




(74) 


Substituting ||gt|| by the upper bound in ( |74| ) and considering 
the definition A := 1/(2(1 — i5) -I- aM) follows that we can 
update the right hand side of (|72|i as 


d: 


'^^^gt+i < (1 - e 

(1 - e -I- ep^+^) 
(2(1 — A) -I- am) 


r,K + l\ 


ep- ' *) 

aeL\\gt-i 


D7-Ts. 


H 


t-1 


ae 


D 


- 1/2 


H: 


D 


A 

- 1/2 
t-i gt 

-1 


1/2 



Dt-i gt 


(75) 


Observe that the norms ||Hj ^|| and ||H(_\|| are upper bounded 
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by A which is defined in ( |27l l. Moreover, the norm ||D( 
is bound above by 1/(2(1 — A) + Substituting these 

bounds into results in 


D* Ht+i 


< (l-e + ep^+^) (l + Ci||gt_i||^) 


ae 


■LA^ 


2A(2(1 — A) + am) ■ 




2 

? 

(76) 


where Ci is defined as 


Cl := 


aeLA ^ 

A(2(l — A) + am)"^ _ 


(77) 


1 /2 

The next step is to establish an upper bound for ||gt-i || in 
terms of the objective function error F{yt) — F{y*). Observe 
that the eigenvalues of the Hessian are bounded above by 2(1 — 
S)+aM. This bound in association with the Taylor’s expansion 
of the objective function F{y) around y leads to 

F{y) < F(y) + VF(y)^(y-y) + ^^i^^4^^||y-yf. 

(78) 

According to the definition of A in •nzl we can substitute 
1/(2(1 — (5) + aM) by A. Applying this substitution into ( fTS] ) 
and minimizing the both sides of ( |78] l with respect to y yields 

Fiyn< F(y)-A||VF(y)f. (79) 


Since ( |79| ) holds for any y, we set y := yt_i. By rearranging 
the terms and taking their square roots, we obtain an upper 
bound for the gradient norm ||VF(yt_i)|| = ||gt_i|| as 


|gt-i|| < 


A 




(80) 


The linear convergence of the objective function error implies 
that F{yt-i) - F{y*) < (1 - 0*-\F{yo) - F{y*)) - see 
Theorem [T] Considering this inequality and the relation in ( |80| ) 
we can write 

l|gt-iH^ < \ F{yo)-F{y*)). (81) 

The upper bound for the squared norm ||gt_i|p in ( [ST] ) shows 
that ||gt_i|j^/^ is upper bounded by 


|gt-i||" < 


(1-C) 


t-1 


-{F{yo)-F{y*)) 


(82) 


By considering the definition of r 2 in ( [3^ and substituting the 
upper bound in ( [82l i for ||gt_i||^/^, we can update the right 
hand of ([76ll as 


d: 


■gt+i 


< (l — e + 1 + C 2 (l ~ C) 

+ e2r2||D7_\gtf, 


D 


t-igt 

(83) 


where C 2 '■= C'i[(F(yo) — F(y*))/A]^/^. Considering the 
definition of Ci in ^Tf\ , C 2 is given by 

[aeLA)HFiyii)-Fir))\ 

At (2(1 - A) + am) 


The explicit expression for C 2 in (|84| and the definition of Ti 


in ( |3^ show that C 2 = Ti. This observation in association 
with ( [8^ leads to the claim in ( [3T| l. 
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